Abstract
Introduction
PacBio Iso-Seq long-read sequencing enables full-length transcript identification, providing high-resolution isoform-level information (Veiga, Nesta et al. 2022). However, its limited sequencing depth poses significant challenges in quantifying transcripts of lowly expressed genes, particularly when transcript abundance falls below TPM < 10 and/or transcript length larger than 2kb (Pardo-Palacios, Wang et al. 2024). Standard analysis pipelines such as SQANTI3 (Pardo-Palacios, Arzalluz-Luque et al. 2024) might mis-classify long reads into distinct transcripts under these conditions. In contrast, Illumina short-read RNA-seq offers much deeper coverage but lacks the ability to resolve transcript isoforms, as the short reads rarely span unique transcript-specific regions (Apostolides, Choi et al. 2024). The software TAGET (Xia, Jin et al. 2023) attempts to splits long-reads into short-reads to improve the mapping sensitivity in classification of isoforms but the tool is not optimized for lowly expressed genes and relies on a prerequisite of accurate transcript reconstruction. IsoQuant (Prjibelski, Mikheenko et al. 2023), a tool capable of reference-aware transcript reconstruction and isoform collapsing, can provide novel isoforms but faces similar issues when quantifying isoforms of lowly expressed genes. In this study, we proposed a series of actionable approaches to integrate long- and short-read data to quantify low-level expressed transcripts. The proposed analytic procedure was used to quantify the hematologic disease gene SBDS (Costa and Santos 2008).
Methods
The protocol begins with PacBio HiFi circular consensus sequencing reads, which are processed using the standard Iso-Seq3 modules: demultiplexing, adapter and poly(A) trimming, chimera filtering, hierarchical clustering, and alignment to the hg38 human reference genome using minimap2's (Li 2018) Pacbio wrapper-PBMM2 (https://github.com/PacificBiosciences/pbmm2). Isoforms are then reconstructed and collapsed using IsoQuant. To increase quantification sensitivity, we modified the reference transcript model required by TAGET and used the IsoQuant-derived novel transcripts along with that of known transcripts as custom transcript models. The major known transcripts by high-depth short-reads of each current sample were incorporated into TAGET to correctly designate known isoforms for predicted transcripts. This step enhances TAGET to accurately quantify lowly expressed genes.
Apart from the above integrated quantification of genes and transcripts, the workflow further provides allele-specific expression (ASE) analyses for mutations. For ASE, long reads containing known or putative variants were extracted using SAMtools (Li, Handsaker et al. 2009), grouped by allele of mutations, and assigned to specific transcript isoforms to resolve allelic imbalance of mutations at the isoform level. Aligned reads of target genes were converted into GTF format and visualized in the viewing mode of CLC Genomics Workbench accessing basic analysis tools without a license, enabling clear visualization of isoform structures, read support, and allele-specific expression.
Results
We applied our protocol to quantify transcript isoforms of SBDS, a gene implicated in Shwachman-Diamond syndrome (SDS) (Boocock, Morrison et al. 2003), using matched PacBio long-read and Illumina short-read RNA-seq data from human bone marrow stromal cells. Long reads sequencing data were processed through IsoQuant to generate a refined set of novel transcript isoforms. These isoforms along with known isoforms were used as a reference. In default settings, SQANTI, TAGET and IsoQuant mistakenly matched the major predicted transcript to be a non-sense mediated decay of SBDS (ENST00000414306) rather than the canonical transcript (ENST00000246868). After updating TAGET using the new reference and major known transcript ratios, we accurately quantified the expression of major SBDS isoforms at low abundance. ASE analysis for SBDS mutations detected significant allelic imbalance in mutants versus wild-type transcripts. Visualization of SBDS transcript structures and read alignments using CLC Genomics Workbench confirmed the presence of multiple isoforms supported by both long- and short-read data.
Conclusions
This integrative protocol enhances the resolution and accuracy of transcript quantification, especially for lowly expressed genes and further provides new insights into allele-specific regulations.